ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

نویسندگان

چکیده

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with Vision Transformer (ViT) backbone network to derive and frame features, head process these features for the task of event recognition explanation in video, is proposed. The ViGAT consists graph attention (GAT) blocks factorized along spatial temporal dimensions order capture effectively both local long-term dependencies between objects or frames. Moreover, using weighted in-degrees (WiDs) derived from adjacency matrices at various GAT blocks, we show proposed architecture can identify most salient frames explain decision network. A comprehensive evaluation study performed, demonstrating approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet) a .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Bottom-Up Spatiotemporal Visual Attention Model for Video Analysis

A video analysis framework based on spatiotemporal saliency calculation is presented. We propose a novel scheme for generating saliency in video sequences by taking into account both the spatial extent and dynamic evolution of regions. Towards this goal we extend a common image-oriented computational model of saliency-based visual attention to handle spatiotemporal analysis of video in a volume...

متن کامل

Is Bottom-Up Attention Useful for Scene Recognition?

The human visual system employs a selective attention mechanism to understand the visual world in an efficient manner. In this paper, we show how computational models of this mechanism can be exploited for the computer vision application of scene recognition. First, we consider saliency weighting and saliency pruning, and provide a comparison of the performance of different attention models in ...

متن کامل

Event-Related Potentials of Bottom-Up and Top-Down Processing of Emotional Faces

Introduction: Emotional stimulus is processed automatically in a bottom-up way or can be processed voluntarily in a top-down way. Imaging studies have indicated that bottom-up and top-down processing are mediated through different neural systems. However, temporal differentiation of top-down versus bottom-up processing of facial emotional expressions has remained to be clarified. The present st...

متن کامل

Investigating bottom-up auditory attention

Bottom-up attention is a sensory-driven selection mechanism that directs perception toward a subset of the stimulus that is considered salient, or attention-grabbing. Most studies of bottom-up auditory attention have adapted frameworks similar to visual attention models whereby local or global "contrast" is a central concept in defining salient elements in a scene. In the current study, we take...

متن کامل

Models of Bottom-Up Attention and Saliency

Visually conspicuous, or so-called salient, stimuli often have the capability of attracting focal visual attention towards their locations. Several computational architectures subserving this bottom-up, stimulus-driven, spatiotemporal deployment of attention are reviewed in this article. The resulting computational models have applications not only to the prediction of visual search psychophysi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2022

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2022.3213652